This post uses data from the for Data Science online learning community. Each Tuesday, a dataset is posted and people share their insights and visualizations like fine wine. You can Twitter the hashtag #TidyTuesday to sample the vintage of the week. This week’s dataset (September 1st, 2020) is from Our World in Data. There are actually five datasets but we are going to focus on global crop yields.
After exploring the data, we will create a model that predict the yield per hectare (yield_hectare ~ population + year), using the year and population. The dataset is too big to use every country and crop we will choose the top 25 countries by population and select 4 common crops. Altother, that will produce 100 linear models. Since that’s too much to digest in a table, we will present the model results as a labeled data visualization where each point represents a model’s adjusted p-value (y-axis) and estimate or slope (x-axis) for a given country and crop.
Let’s start by downloading the data from the R4DS Github
account and take a look at it. The tidytuesdayR package (by Ellis
Hughes, co-host of the screencast ‘TidyX’) is a handy package for
checking out available datasets, exploring the data dictionary in the
Rstudio viewer and loading the desired data (and more). Although simple
to use, I’m going to stick with the readr::read_csv() to
import the data as a tibble (Tidyverse lingo for a modern
dataframe object).
# Load the data with a readr function
crop_yields <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/key_crop_yields.csv") %>%
janitor::clean_names()
land_use <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-01/land_use_vs_yield_change_in_cereal_production.csv") %>%
janitor::clean_names() %>%
select(entity, code, year, population = total_population_gapminder) %>%
mutate(year = as.numeric(year))
crop_yield <- crop_yields %>%
left_join(land_use, by = c('code', 'year', 'entity'))
skimr::skim(crop_yields)
| Name | crop_yields |
| Number of rows | 13075 |
| Number of columns | 14 |
| _______________________ | |
| Column type frequency: | |
| character | 2 |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| entity | 0 | 1.00 | 4 | 39 | 0 | 249 | 0 |
| code | 1919 | 0.85 | 3 | 8 | 0 | 214 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1.00 | 1990.37 | 16.73 | 1961.00 | 1976.00 | 1991.00 | 2005.00 | 2018.00 | ▇▆▇▇▇ |
| wheat_tonnes_per_hectare | 4974 | 0.62 | 2.43 | 1.69 | 0.00 | 1.23 | 1.99 | 3.12 | 10.67 | ▇▅▂▁▁ |
| rice_tonnes_per_hectare | 4604 | 0.65 | 3.16 | 1.85 | 0.20 | 1.77 | 2.74 | 4.16 | 10.68 | ▇▇▃▁▁ |
| maize_tonnes_per_hectare | 2301 | 0.82 | 3.02 | 3.13 | 0.03 | 1.14 | 1.83 | 3.92 | 36.76 | ▇▁▁▁▁ |
| soybeans_tonnes_per_hectare | 7114 | 0.46 | 1.45 | 0.75 | 0.00 | 0.86 | 1.33 | 1.90 | 5.95 | ▇▇▂▁▁ |
| potatoes_tonnes_per_hectare | 3059 | 0.77 | 15.40 | 9.29 | 0.84 | 8.64 | 13.41 | 20.05 | 75.30 | ▇▅▁▁▁ |
| beans_tonnes_per_hectare | 5066 | 0.61 | 1.09 | 0.82 | 0.03 | 0.59 | 0.83 | 1.35 | 9.18 | ▇▁▁▁▁ |
| peas_tonnes_per_hectare | 6840 | 0.48 | 1.48 | 1.01 | 0.04 | 0.72 | 1.15 | 1.99 | 7.16 | ▇▃▁▁▁ |
| cassava_tonnes_per_hectare | 5887 | 0.55 | 9.34 | 5.11 | 1.00 | 5.55 | 8.67 | 11.99 | 38.58 | ▇▇▁▁▁ |
| barley_tonnes_per_hectare | 6342 | 0.51 | 2.23 | 1.50 | 0.09 | 1.05 | 1.88 | 3.02 | 9.15 | ▇▆▂▁▁ |
| cocoa_beans_tonnes_per_hectare | 8466 | 0.35 | 0.39 | 0.28 | 0.00 | 0.24 | 0.36 | 0.49 | 3.43 | ▇▁▁▁▁ |
| bananas_tonnes_per_hectare | 4166 | 0.68 | 15.20 | 12.08 | 0.66 | 5.94 | 11.78 | 20.79 | 77.59 | ▇▃▁▁▁ |
# Here are a few other options for getting to know the data
# glimpse(crop_yields)
# summary(crop_yields)
# View(crop_yields)
Let’s start with a little data cleaning! skim() from
skimr provides a convenient way to get summary statistics
and more. By looking at the data we can see that there are missing
values that need to be removed or imputed but other than that, the data
is clean. One thing we should consider is whether the data is in a tidy
format where each column is a feature and each row a sample. This format
is important when we want to model the data but sometimes it’s helpful
to pivot the data into a long format for creating visualizations across
multiple categories with ggplot2::facet_wrap() and related
functions. We will use the handy function, pivot_longer(),
from the tidyr package to combine crops into a single column and present
their respective values in a separate column in preparation for
exploratory data analysis (EDA).
# Use pivot_longer from tidyr to tidy the data.
crop_yield_tidy <- crop_yield %>%
rename_all(str_remove, "_tonnes.*") %>%
rename(country = entity) %>%
pivot_longer(wheat:bananas, names_to = 'crop', values_to = 'yield_hectare') %>%
drop_na(yield_hectare)
# Create a vector of the 25 countries with the largest population
top_pops <- land_use %>%
filter(!is.na(code), entity != "World") %>%
group_by(entity) %>%
filter(year == max(year)) %>%
ungroup() %>%
slice_max(population, n = 25) %>%
pull(entity)
kable(head(crop_yield_tidy, n = 10))
| country | code | year | population | crop | yield_hectare |
|---|---|---|---|---|---|
| Afghanistan | AFG | 1961 | 9169000 | wheat | 1.0220 |
| Afghanistan | AFG | 1961 | 9169000 | rice | 1.5190 |
| Afghanistan | AFG | 1961 | 9169000 | maize | 1.4000 |
| Afghanistan | AFG | 1961 | 9169000 | potatoes | 8.6667 |
| Afghanistan | AFG | 1961 | 9169000 | barley | 1.0800 |
| Afghanistan | AFG | 1962 | 9351000 | wheat | 0.9735 |
| Afghanistan | AFG | 1962 | 9351000 | rice | 1.5190 |
| Afghanistan | AFG | 1962 | 9351000 | maize | 1.4000 |
| Afghanistan | AFG | 1962 | 9351000 | potatoes | 7.6667 |
| Afghanistan | AFG | 1962 | 9351000 | barley | 1.0800 |
A great place to start any analysis is by counting the representation
of certain categories in the dataset using the count() or
add_count() function. For instance, we might want to know
the count of each country across years which will tell us the number of
crops recorded for each country. It often makes sense to combine
count() with a bar plot to visualize meaningful
distributions.
EDA is often where I have the most fun in my workflow..plus it’s arguably the most important step in a data analysis pipeline. Among other things, it provides insights that will help us have an intuitive sense about what to expect when we generate machine learning models and evaluate their performance.
crop_yield_tidy %>%
add_count(year, country, sort = T) %>%
filter(country %in% c("Albania", "Africa", "United States")) # Africa has 11 crops, US has 9 and Albania 6.
## # A tibble: 1,522 × 7
## country code year population crop yield_hectare n
## <chr> <chr> <dbl> <dbl> <chr> <dbl> <int>
## 1 Africa <NA> 1961 290214016 wheat 0.693 11
## 2 Africa <NA> 1961 290214016 rice 1.55 11
## 3 Africa <NA> 1961 290214016 maize 1.04 11
## 4 Africa <NA> 1961 290214016 soybeans 0.376 11
## 5 Africa <NA> 1961 290214016 potatoes 8.18 11
## 6 Africa <NA> 1961 290214016 beans 0.587 11
## 7 Africa <NA> 1961 290214016 peas 0.679 11
## 8 Africa <NA> 1961 290214016 cassava 5.66 11
## 9 Africa <NA> 1961 290214016 barley 0.446 11
## 10 Africa <NA> 1961 290214016 cocoa_beans 0.254 11
## # … with 1,512 more rows
crop_yield_tidy %>%
count(country, sort = TRUE) %>%
filter(n >=500) # this gives a list of the countries with the most years + crops 11 (crops) * 51 (year) = 561
## # A tibble: 53 × 2
## country n
## <chr> <int>
## 1 Africa 638
## 2 Americas 638
## 3 Asia 638
## 4 Central America 638
## 5 Democratic Republic of Congo 638
## 6 Eastern Africa 638
## 7 Ecuador 638
## 8 Land Locked Developing Countries 638
## 9 Least Developed Countries 638
## 10 Low Income Food Deficit Countries 638
## # … with 43 more rows
crop_yield_tidy %>%
summarize(max(year)- min(year)) #57 years of recorded data
## # A tibble: 1 × 1
## `max(year) - min(year)`
## <dbl>
## 1 57
crop_yield_tidy %>% # 249 unique countries
distinct(country) %>%
nrow()
## [1] 249
Let’s start by comparing crop yield by country for the United States to see how crop yields have changed over time. While we’re at it, let’s look at the most abundant crop yields overall.
# Potatoes!
us_yields <- crop_yield_tidy %>%
filter(country == 'United States') %>%
ggplot(aes(year, yield_hectare, color = crop)) +
geom_line() +
facet_wrap(~country, 'free_x') +
theme(plot.title.position = 'plot') +
labs(
title = 'Crop Yield USA',
y = 'Year',
x = 'tonnes per hectare per year'
)
plotly::ggplotly(us_yields)
# plot of USA data only
crop_yield_tidy %>%
filter(country == 'United States') %>%
mutate(crop = fct_reorder(crop, yield_hectare)) %>%
ggplot(aes(yield_hectare, crop, fill = crop)) +
geom_col() +
facet_wrap(~country, scales = 'free_y') +
theme(legend.position = 'none',
plot.title.position = 'plot') +
labs(
title = 'Crop Yield USA',
y = 'Crop',
x = 'tonnes per hectare'
)
There are too many countries to add to a single plot, so we will
likely need to decide how we want to subset the data. For this analysis,
let’s use dplyr::slice_sample() to randomly select a subset
of the total countries. Unless a seed is set, the selection will be
different each time we run the code chunk. To start, let’s choose 9
countries and visualize the change in crop yields over time.
# set.seed(1014) if you want to select the same 9 countries each time
# Randomly select 9 country names
crops <- crop_yield_tidy %>% distinct(crop) %>% pull()
country_random <- crop_yield_tidy %>%
select(country) %>%
distinct() %>%
slice_sample(n = 9) %>%
pull()
# Plot crop yield per hectare per year.
crop_yield_tidy %>%
filter(country %in% country_random,
crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
mutate(crop = fct_reorder(crop, yield_hectare)) %>%
ggplot(aes(year, yield_hectare, color = crop)) +
geom_line(size = 1, alpha = 0.5) +
facet_wrap(~country, scales = 'free_y') +
theme(plot.title.position = 'plot') +
labs(
y = 'yield per hectare',
title = 'Crop Yield'
)
crop_yield_tidy %>%
filter(country %in% top_pops,
crop == c('maize', 'potatoes', 'wheat', 'bananas')) %>%
mutate(crop = reorder_within(crop, yield_hectare, country)) %>%
ggplot(aes(yield_hectare, crop, fill = crop)) +
geom_col() +
scale_y_reordered() +
facet_wrap(~country, scales = 'free') +
theme(legend.position = 'none',
plot.title.position = 'plot') +
labs(
title = 'Crop Yield',
y = 'Crop',
x = 'yield per hectare'
)
crop_yield_tidy %>%
filter(country %in% top_pops,
crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
mutate(crop = fct_reorder(crop, yield_hectare)) %>%
ggplot(aes(year, yield_hectare, color = crop)) +
geom_line(size = 1, alpha = 0.5) +
theme(plot.title.position = 'plot') +
facet_wrap(~country, scales = 'free_y') +
labs(
y = 'yield per hectare',
title = 'Crop Yield'
)
Next, let’s make a regression model for each of the top 25 countries.
library(broom)
library(ggrepel)
many_models <- crop_yield_tidy %>%
drop_na() %>%
filter(country %in% top_pops,
crop %in% c('maize', 'potatoes', 'wheat', 'bananas')) %>%
nest(mod_data = c(year, yield_hectare, population)) %>%
mutate(model = map(mod_data, ~lm(yield_hectare ~ population + year, data = .))) %>%
mutate(coefs = map(model, tidy)) %>%
unnest(coefs)
Here’s what the data looks like after unnesting the coefficients.
many_models %>%
head()
## # A tibble: 6 × 10
## country code crop mod_data model term estimate std.error
## <chr> <chr> <chr> <list> <list> <chr> <dbl> <dbl>
## 1 Bangladesh BGD wheat <tibble> <lm> (Intercep… -4.02e+2 4.99e+1
## 2 Bangladesh BGD wheat <tibble> <lm> population -8.05e-8 1.23e-8
## 3 Bangladesh BGD wheat <tibble> <lm> year 2.07e-1 2.57e-2
## 4 Bangladesh BGD maize <tibble> <lm> (Intercep… 5.31e+2 2.63e+2
## 5 Bangladesh BGD maize <tibble> <lm> population 1.90e-7 6.47e-8
## 6 Bangladesh BGD maize <tibble> <lm> year -2.76e-1 1.36e-1
## # … with 2 more variables: statistic <dbl>, p.value <dbl>
Finally, let’s make the results more meaningful by creating a visualization to summarize the model results.
many_models %>%
filter(term == "year") %>%
mutate(p.value = p.adjust(p.value)) %>%
ggplot(aes(estimate, p.value, label = country)) +
geom_vline(xintercept = 0, lty = 2,
size = 1.5, alpha = 0.7, color = "gray50") +
geom_point(aes(color = crop), alpha = 0.8, size = 2.5, show.legend = FALSE) +
scale_y_log10() +
facet_wrap(~crop) +
geom_text_repel(size = 2.5, family = "Times New Roman")